-
Notifications
You must be signed in to change notification settings - Fork 459
[Perf] Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems #2962
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
这个 PR 主要是为了接入新的 splitfuse
chunked prefill 算子。代码改动涉及 attention_v1.py
和 model_runner_v1.py
两个文件。在 attention_v1.py
中,_forward_v1_style
函数的注意力计算从 _npu_paged_attention_splitfuse
切换到了 npu_fused_infer_attention_score
。在 model_runner_v1.py
中,为 ChunkedPrefill
场景生成 attention mask 的逻辑被修改。
我的审查发现两个严重问题:
- 在
attention_v1.py
中,传递给新算子的actual_seq_lengths
参数值是错误的,使用了累积的 token 位置而不是序列长度,这会导致注意力计算错误。 - 在
model_runner_v1.py
中,为ChunkedPrefill
生成的 attention mask 使用了硬编码的尺寸(2048, 2048)
,这使得代码很脆弱,当序列长度超过 2048 时会导致错误。
建议修复这两个严重问题以保证代码的正确性和健壮性。
num_kv_heads=self.num_kv_heads, | ||
input_layout="TND", | ||
block_size=block_size, | ||
actual_seq_lengths=attn_metadata.query_start_loc[1:], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz consider use attn_metadata.query_lens
if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla: | ||
return self.attn_mask_builder.get_splitfuse_attn_mask( | ||
seq_lens, position, self.dtype, self.device) | ||
return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Attention mask 使用了硬编码的尺寸 (2048, 2048)
。这是一个魔法数字,使得实现不够健壮。如果批处理中任何序列的长度超过 2048,将导致不正确的掩码或越界错误。掩码的大小应该由模型配置的最大序列长度决定,以确保正确性并避免魔法数字。
return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8) | |
return torch.triu(torch.ones(self.model_config.max_model_len, self.model_config.max_model_len), diagonal=1).to(torch.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because new op support compressed mask
Could you integrate other scenarios, such as full FlashAttention, using the FIA interface as well, and provide the performance test results? |
|
||
self.attn_mask_builder = AttentionMaskBuilder( | ||
self.model_config.max_model_len, self.dtype) | ||
self.model_config.max_model_len, self.dtype, self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we pass self.model_config.max_model_len
here we still have an attention mask of shape [max_model_len, max_model_len]. We may have 2 choices:
- Passing
self.scheduler_config.max_num_batched_tokens
instead ofself.model_config.max_model_len
. - Also using new attention op for full prefill case.
num_kv_heads=self.num_kv_heads, | ||
input_layout="TND", | ||
block_size=block_size, | ||
actual_seq_lengths=attn_metadata.query_start_loc[1:], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
plz consider use attn_metadata.query_lens
Also notice another 2 points:
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
num_heads=self.num_heads, | ||
scale_value=self.scale, | ||
out=output) | ||
if self.compressed_mask: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add TODO: This op will be used in more situation in future.
return self.attn_mask_builder.get_splitfuse_attn_mask( | ||
seq_lens, position, self.dtype, self.device) | ||
if selkf.compressed_mas: | ||
return self.attn_mask_builder.get_splitfuse_attn_mask() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
code bug
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
vllm_ascend/utils.py
Outdated
raise ValueError("Invalid type for tensors") | ||
|
||
|
||
def verify_torch_npu_version( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cann version needs to be verfied. using package.version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add todo for temp
# so device needs to be passed here. | ||
assigned_mask_dim = 2048 | ||
self.chunked_prefill_attn_mask = torch.triu(torch.ones(assigned_mask_dim, assigned_mask_dim), diagonal=1 | ||
).to(torch.int8).to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check type again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this op currently only supports int8 mask
|
||
self.attn_mask_builder = AttentionMaskBuilder( | ||
self.model_config.max_model_len, self.dtype) | ||
self.model_config.max_model_len, self.dtype, self.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for choice 1:
self.model_config.max_model_len, self.dtype, self.device) | |
self.scheduler_config.max_num_batched_tokens, self.dtype, self.device) |
if attn_state == AscendAttentionState.ChunkedPrefill and not self.vllm_config.model_config.use_mla: | ||
return self.attn_mask_builder.get_splitfuse_attn_mask( | ||
seq_lens, position, self.dtype, self.device) | ||
return torch.triu(torch.ones(2048, 2048), diagonal=1).to(torch.int8) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is because new op support compressed mask
# so device needs to be passed here. | ||
assigned_mask_dim = 2048 | ||
self.chunked_prefill_attn_mask = torch.triu(torch.ones(assigned_mask_dim, assigned_mask_dim), diagonal=1 | ||
).to(torch.int8).to(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this op currently only supports int8 mask
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: tangtianyi <tangtianyi4@huawei.com>
Signed-off-by: Angazenn <supperccell@163.com>
Signed-off-by: Angazenn <supperccell@163.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work! just note: we need to remember to upgrade torch-npu and cann together in ci later
…ce in splitfuse cases and resolve long-seq mask problems (vllm-project#2962) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems . 1. The original op's performance is suboptimal in certain scenarios, necessitating optimization through the _new op_ (npu_fused_infer_attention_score)。 2. For ultra-long sequences (128k), the original operator will allocate a large attn_mask, which consumes excessive CPU memory. In contrast, the _new op_ supports a fixed-size compressed mask, effectively resolving this issue. NOTE1: The current PR retains the original logic and uses a version check of the CANN package to determine whether the _new op_ can be enabled. This ensures no impact on existing users. In future versions, this version check and the original logic will be deprecated, and the _new op_ scheduling will be uniformly adopted. NOTE2: This pr relies on future CANN version, which is not available now. NOTE3: To enable the new op in chunked prefill, the parameter additional_config should be set like `--additional-config '{"ascend_scheduler_config": {"enabled":true,"enable_chunked_prefill":true}}' \` at least. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@6c5f82e --------- Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: Angazenn <supperccell@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
…ce in splitfuse cases and resolve long-seq mask problems (vllm-project#2962) ### What this PR does / why we need it? Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems . 1. The original op's performance is suboptimal in certain scenarios, necessitating optimization through the _new op_ (npu_fused_infer_attention_score)。 2. For ultra-long sequences (128k), the original operator will allocate a large attn_mask, which consumes excessive CPU memory. In contrast, the _new op_ supports a fixed-size compressed mask, effectively resolving this issue. NOTE1: The current PR retains the original logic and uses a version check of the CANN package to determine whether the _new op_ can be enabled. This ensures no impact on existing users. In future versions, this version check and the original logic will be deprecated, and the _new op_ scheduling will be uniformly adopted. NOTE2: This pr relies on future CANN version, which is not available now. NOTE3: To enable the new op in chunked prefill, the parameter additional_config should be set like `--additional-config '{"ascend_scheduler_config": {"enabled":true,"enable_chunked_prefill":true}}' \` at least. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? CI passed - vLLM version: v0.10.2 - vLLM main: vllm-project/vllm@6c5f82e --------- Signed-off-by: tangtianyi <tangtianyi4@huawei.com> Signed-off-by: Angazenn <supperccell@163.com> Co-authored-by: Angazenn <supperccell@163.com> Signed-off-by: Che Ruan <cr623@ic.ac.uk>
What this PR does / why we need it?
Add new npu_fused_infer_attention_score op to improve perfomance in splitfuse cases and resolve long-seq mask problems .
NOTE1: The current PR retains the original logic and uses a version check of the CANN package to determine whether the new op can be enabled. This ensures no impact on existing users. In future versions, this version check and the original logic will be deprecated, and the new op scheduling will be uniformly adopted.
NOTE2: This pr relies on future CANN version, which is not available now.
NOTE3: To enable the new op in chunked prefill, the parameter additional_config should be set like
--additional-config '{"ascend_scheduler_config": {"enabled":true,"enable_chunked_prefill":true}}' \
at least.Does this PR introduce any user-facing change?
No
How was this patch tested?
CI passed